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Abstract 


The origin and early spread of SARS-CoV-2 remains shrouded in mystery. Here, | identify a data set containing SARS-CoV- 
2 sequences from early in the Wuhan epidemic that has been deleted from the NIH’s Sequence Read Archive. | recover 
the deleted files from the Google Cloud and reconstruct partial sequences of 13 early epidemic viruses. Phylogenetic 
analysis of these sequences in the context of carefully annotated existing data further supports the idea that the Huanan 
Seafood Market sequences are not fully representative of the viruses in Wuhan early in the epidemic. Instead, the 
progenitor of currently known SARS-CoV-2 sequences likely contained three mutations relative to the market viruses 


that made it more similar to SARS-CoV-2’s bat coronavirus relatives. 
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Understanding the spread of SARS-CoV-2 in Wuhan is crucial 
to tracing the origins of the virus, including identifying events 
that led to infection of patient zero. The first reports outside 
of China at the end of December 2019 emphasized the role of 
the Huanan Seafood Market (ProMED 2019), which was ini- 
tially suggested as a site of zoonosis. However, this theory 
became increasingly tenuous as it was learned that many early 
cases had no connection to the market (Cohen 2020; Huang 
et al. 2020; Chen et al. 2020). Eventually, Chinese CDC 
Director Gao Fu dismissed the theory, stating “At first, we 
assumed the seafood market might have the virus, but now 
the market is more like a victim. The novel coronavirus had 
existed long before” (Global Times 2020). 

Indeed, there were reports of cases that far preceded the 
outbreak at the Huanan Seafood Market. The Lancet de- 
scribed a confirmed case having no association with the mar- 
ket whose symptoms began on December 1, 2019 (Huang 
et al. 2020). The South China Morning Post described nine 
cases from November 2019 including details on patient age 
and sex, noting that none were confirmed to be “patient zero” 
(Ma 2020). Professor Yu Chuanhua of Wuhan University told 
the Health Times that records he reviewed showed two cases 
in mid-November, and one suspected case on September 29 
(Health Times 2020). At about the same time as Professor 
Chuanhua’s interview, the Chinese CDC issued an order for- 
bidding sharing of information about the COVID-19 epidemic 
without approval (China CDC 2020), and shortly thereafter 
Professor Chuanhua recontacted the Health Times to say that 
the November cases could not be confirmed (Health Times 
2020). Then, China’s State Council issued a much broader 
order requiring central approval of all publications related 
to COVID-19 to ensure they were coordinated “like moves 
in a game of chess” (Kang, Cheng, et al. 2020). In 2021, the 


joint WHO-China report dismissed all reported cases prior to 
December 8 as not COVID-19, reviving the theory that the 
virus might have originated at the Huanan Seafood Market 
(WHO 2021). 

In other outbreaks where direct identification of early cases 
has been stymied, it has increasingly become possible to use 
genomic epidemiology to infer the timing and dynamics of 
spread from the analysis of viral sequences. For instance, anal- 
ysis of SARS-CoV-2 sequences has enabled the reconstruction 
of the initial spread of SARS-CoV-2 in North America and 
Europe (Bedford et al. 2020; Worobey et al. 2020; Deng et al. 
2020; Fauver et al. 2020). 

But in the case of Wuhan, genomic epidemiology has also 
proven frustratingly inconclusive. Some of the problem is 
simply limited data: despite the fact that Wuhan has ad- 
vanced virology labs, there is only patchy sampling of SARS- 
CoV-2 sequences from the first months of the city’s explosive 
outbreak. Other than a set of multiply sequenced samples 
collected in late December of 2019 from a dozen patients 
connected to the Huanan Seafood Market (WHO 2021), just 
a handful of Wuhan sequences are available from before late 
January of 2020 (see analysis in this study below). This paucity 
of sequences could be due in part to an order that unauthor- 
ized Chinese labs destroy all coronavirus samples from early in 
the outbreak, reportedly for “laboratory biological safety” rea- 
sons (Pingui 2020). 

However, the Wuhan sequences that are available have 
also confounded phylogenetic analyses designed to infer 
the “progenitor” of SARS-CoV-2, which is the sequence 
from which all other currently known sequences are 
descended (Kumar et al. 2021). Although there is debate 
about exactly how SARS-CoV-2 entered the human popula- 
tion, it is universally accepted that the virus’s deep ancestors 
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are bat coronaviruses (Lytras et al. 2021). But the earliest 
known SARS-CoV-2 sequences, which are mostly derived 
from the Huanan Seafood Market, are notably more different 
from these bat coronaviruses than other sequences collected 
at later dates outside Wuhan. As a result, there is a direct 
conflict between the two major principles used to infer an 
outbreak’s progenitor: namely that it should be among the 
earliest sequences and that it should be most closely related 
to deeper ancestors (Pipes et al. 2021). 

Here, | take a small step toward resolving these questions 
by identifying and recovering a deleted data set of partial 
SARS-CoV-2 sequences from outpatient samples collected 
early in the Wuhan epidemic. Analysis of these new sequen- 
ces in conjunction with careful annotation of existing ones 
suggests that the early Wuhan samples that have been the 
focus of most studies including the joint WHO-China report 
(WHO 2021) are not fully representative of the viruses actu- 
ally present in Wuhan at that time. These insights help rec- 
oncile phylogenetic discrepancies and suggest two plausible 
progenitor sequences, one of which is identical to that in- 
ferred by Kumar et al. (2021) and the other of which contains 
the C29095T mutation. Furthermore, the approach taken 
here hints it may be possible to advance the understanding 
of SARS-CoV-2’s origins or early spread even without further 
on-the-ground studies, such as by more deeply probing data 
archived by the NIH and other entities. 


Results 


Identification of a SARS-CoV-2 Deep Sequencing Data 
Set That Has Been Removed from the Sequence Read 
Archive 

During the course of my research, | read a paper by Farkas 
et al. (2020) that analyzed SARS-CoV-2 deep sequencing data 
from the Sequence Read Archive (SRA), which is a repository 
maintained by the NIH’s National Center for Biotechnology 
Information. The first supplementary table of Farkas et al. 
(2020) lists all SARS-CoV-2 deep sequencing data available 
from the SRA as of March 30, 2020. 

The majority of entries in this table refer to a project 
(BioProject PRINA612766) by Wuhan University that is de- 
scribed as nanopore sequencing of SARS-CoV-2 amplicons. 
The table indicates that this project represents 241 of the 282 
SARS-CoV-2 sequencing run accessions in the SRA as of 
March 30, 2020. Because | had never encountered any other 
mention of this project, | performed a Google search for 
“PRJNAG12766” and found no search hits other than the 
supplementary table itself. Searching for “PRINA612766” in 
the NCBI’s SRA search box returned a message of “No items 
found.” | then searched for individual sequencing run acces- 
sions from the project in the NCBI’s SRA search box. These 
searches returned messages indicating that the sequencing 
runs had been removed (fig. 1). 

The SRA is designed as a permanent archive of deep se- 
quencing data. The SRA documentation states that after a 
sequencing run is uploaded, “neither its files can be replaced 
nor filenames can be changed” and that data can only be 
deleted by e-mailing SRA staff (SRA 2021). 
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The Deleted Data Set Contains Sequencing of Viral 
Samples Collected Early in the Wuhan Epidemic 

The metadata in the first supplementary table of Farkas et al. 
(2020) indicates that the samples in deleted project 
PRNJAG12766 were collected by Aisu Fu and Renmin 
Hospital of Wuhan University. Google searching for these 
terms revealed that the samples were related to a study 
posted as a preprint on medRxiv in early March of 2020 
(Wang et al. 2020) and subsequently published in the journal 
Small in June of 2020 (Wang et al. 2020). 

The study describes an approach to diagnose infection 
with SARS-CoV-2 and other respiratory viruses by nanopore 
sequencing. This approach involved reverse transcription of 
total RNA from swab samples, followed by PCR with specific 
primers to generate amplicons covering portions of the viral 
genome. These amplicons were then sequenced on an Oxford 
Nanopore GridION, and infection was diagnosed if the se- 
quencing yielded sufficient reads aligning to the viral genome. 
Importantly, the study notes that this approach yields infor- 
mation about the sequence of the virus as well enabling di- 
agnosis of infection. In fact, Wang et al. (2020) even list the 
mutations determined from this sequencing—but because 
this paper was published in the chemistry journal Small, after 
the sequences were removed from the SRA, their existence 
appears to have been entirely overlooked. 

The preprint (Wang et al. 2020) says the approach was 
applied to “45 nasopharyngeal swab samples from outpa- 
tients with suspected COVID-19 early in the epidemic.” The 
digital object identifier for the preprint indicates that it was 
processed by medRxiv on March 4, 2020, which is 1 day after 
China’s State Council ordered that all papers related to 
COVID-19 must be centrally approved (Kang, Cheng, et al. 
2020). The final published manuscript (Wang et al. 2020) 
from June of 2020 updated the description from “early in 
the epidemic” to “early in the epidemic (January 2020).” 
Both the preprint and published manuscript say that 34 of 
the 45 early epidemic samples were positive in the 
sequencing-based diagnostic approach. In addition, both 
state that the approach was later applied to 16 additional 
samples collected on February 11 and 12, 2020, from SARS- 
CoV-2 patients hospitalized at the Renmin Hospital of 
Wuhan University. 

There is complete concordance between the accessions for 
project PRINAG12766 in the supplementary table of Farkas 
et al. (2020) and the samples described by Wang et al. (2020). 
There are 89 accessions corresponding to the 45 early epi- 
demic samples, with these samples named like wells in a 96- 
well plate (A1, A2, etc.). The number of accessions is approx- 
imately twice the number of early epidemic samples because 
each sample has data for two sequencing runtimes except 
one sample (B5) with just one runtime. There are 31 acces- 
sions corresponding to the 16 samples collected in February 
from Renmin Hospital patients, with these samples named 
R01, R02, etc. Again, all but one sample (R04) have data for 
two sequencing runtimes. In addition, there are 7 accessions 
corresponding to positive and negative controls, 2 accessions 
corresponding to other respiratory virus samples, and 112 
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Fic. 1. Accessions from deep sequencing project PRJNA612766 have been removed from the SRA. Shown is the result of searching for 
“$RR11313485” in the SRA search toolbar. This result has been digitally archived on the Wayback Machine at https://web.archive.org/web/ 
20210502131630/ https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR11313485. 


samples corresponding to plasmids used for benchmarking of 
the approach. Together, these samples and controls account 
for all 241 accessions listed for PRJNA612766 in the supple- 
mentary table of Farkas et al. (2020). 

Neither the preprint (Wang et al. 2020) nor published 
manuscript (Wang et al. 2020) contains any correction or 
note that suggests a scientific reason for deleting the study's 
sequencing data from the SRA. | e-mailed both corresponding 
authors of Wang et al. (2020) to ask why they had deleted the 
deep sequencing data and to request details on the collection 
dates of the early outpatient samples, but received no reply. 


Recovery of Deleted Sequencing Data from the Google 
Cloud 

As indicated in fig. 1, none of the deleted sequencing runs 
could be accessed through the SRA’s web interface. In addi- 
tion, none of the runs could be accessed using the command- 
line tools of the SRA Toolkit. For instance, running fastq- 
dump SRR11313485 or vdb-dump SRR11313485 returned 
the message “err: query unauthorized while resolving query 
within virtual file system module—failed to resolve accession 
‘SRR11313485.” 

However, the SRA has begun storing all data on the 
Google and Amazon clouds. While inspecting the SRA’s 
web interface for other sequencing accessions, | noticed 
that SRA files are often available from links to the cloud 
such as https://storage.googleapis.com/nih-sequence-read-ar- 
chive/run/ACCESSION/ACCESSION. 


Based on the hypothesis that deletion of sequencing 
runs by the SRA might not remove files stored on the 
cloud, | interpolated the cloud URLs for the deleted 
accessions and tested if they still yielded the SRA files. This 
strategy was successful; for instance, as of June 3, 2021, 
going to https://storage.googleapis.com/nih-sequence-read- 
archive/run/SRR11313485/SRR11313485 downloads the 
SRA file for accession SRR11313485. | have archived this 
file on the Wayback Machine at https://web.archive.org/ 
web/20210502130820/https://storage.googleapis.com/nih-se- 
quence-read-archive/run/SRR11313485/SRR11313485. 

| automated this strategy to download the SRA files for 97 
of the 99 sequencing runs corresponding to the 34 SARS- 
CoV-2-positive early epidemic samples and the 16 hospital 
samples from February. The SRA files for two runs 
(SRR11313490 and SRR11313499) were not accessible via 
the Google Cloud, but after | posted the first version of this 
manuscript as a preprint, several individuals found archived 
data for these runs that had been downloaded when they 
were still available on the SRA, and I have used those data in 
the updated analysis described here (see Methods for details). 
| used the SRA Toolkit to get the object timestamp (vdb- 
dump —obj_timestamp) and time (vdb-dump -info) for all 
SRA files. For all files, the object timestamp is February 15, 
2020, and the time is March 16, 2020. Although the SRA 
Toolkit does not clearly document these two properties, 
my guess is that the object timestamp may refer to when 
the SRA file was created from a FASTQ file uploaded to the 
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SRA, and the time may refer to when the accession was made 
public. 


The Data Are Sufficient to Determine the Viral 
Sequence from the Start of Spike through the End of 
ORF10 for Some Samples 
Wang et al. (2020) sequenced PCR amplicons covering nu- 
cleotide sites 21,563-29,674 of the SARS-CoV-2 genome, 
which spans from the start of the spike gene to the end of 
ORF10. They also sequenced a short amplicon generated by 
nested PCR that covered a fragment of ORF1ab spanning sites 
~15,080-15,550. In this paper, | only analyze the region from 
spike through ORF10 because this is a much longer contigu- 
ous sequence and the amplicons were generated by conven- 
tional rather than nested PCR. | slightly trimmed the region of 
interest to 21,570-29,550 because many samples had poor 
coverage at the termini. 

| aligned the recovered deep sequencing data to the SARS- 
CoV-2 genome using minimap2 (Li 2018), combining acces- 
sions for the same sample, and masking regions that corre- 
sponded to the primer binding sites described in Wang et al. 
(2020). Supplementary fig. S1, Supplementary Material online, 
shows the sequencing coverage for the 34 virus-positive early 
epidemic samples and the 16 hospitalized patient samples 
over the region of interest; a comparable plot for the whole 
genome is in supplementary fig. S2, Supplementary Material 
online. 

| called the consensus viral sequence for each sample at 
each site with coverage >3% and >80% of the reads con- 
curring on the nucleotide identity. With these criteria, 13 of 
the early outpatient samples and 1 of the February hospi- 
talized patient samples had sufficient coverage to call the 
consensus sequence at >90% of the sites in the region of 
interest (table 1), and for the remainder of this paper, | focus 
on these high-coverage samples. Table 1 also shows the 
mutations in each sample relative to proCoV2, which is a 
putative progenitor of SARS-CoV-2 inferred by Kumar et al. 
(2021) that differs from the widely used Wuhan-Hu-1 ref- 
erence sequence by three mutations (C8782T, C18060T, 
and T28144C). Although requiring coverage of only >3 is 
relatively lenient, table 1 shows that all sites with mutations 
have coverage > 10. In addition, the mutations | called from 
the raw sequence data in table 1 concord with those men- 
tioned in Wang et al. (2020). Again, this fact emphasizes 
that the information contained in the deleted sequencing 
data is largely present in Wang et al. (2020), but because 
it was only published in a table in the chemistry journal 
Small rather than placed on the SRA, its existence was 
overlooked. 

| also determined the consensus sequence of the plasmid 
control used by Wang et al. (2020) from the recovered se- 
quencing data and found that it had mutations C28144T and 
G28085T relative to proCoV2, which means that in the region 
of interest this control matches Wuhan-Hu-1 with the addi- 
tion of G28085T. Since none of the viral samples in table 1 
contain G28085T and the samples that prove most relevant 
below also lack C28144T (which is a frequent natural 
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mutation among early Wuhan sequences), plasmid contam- 
ination of the viral samples is unlikely. 


Analysis of Existing SARS-CoV-2 Sequences 
Emphasizes the Perplexing Discordance between 
Collection Date and Distance to Bat Coronavirus 
Relatives 

To contextualize the viral sequences recovered from the de- 
leted project, | first analyze early SARS-CoV-2 sequences al- 
ready available in the GISAID database (Shu and McCauley 
2017). The analyses described in this section are not entirely 
novel but synthesize observations from multiple prior studies 
(Rambaut et al. 2020; Forster et al. 2020; Rambaut et al. 2020; 
Kumar et al. 2021; Pekar et al. 2021; Pipes et al. 2021) to 
provide key background. 

Known human SARS-CoV-2 sequences are consistent with 
expansion from a single progenitor sequence (Rambaut et al. 
2020; Forster et al. 2020; Rambaut et al. 2020; Kumar et al. 
2021; Pekar et al. 2021; Pipes et al. 2021). However, attempts 
to infer this progenitor have been confounded by a perplexing 
fact: the earliest reported sequences from Wuhan are not the 
sequences most similar to SARS-CoV-2’s bat coronavirus rel- 
atives (Pipes et al. 2021). This fact is perplexing because al- 
though the proximal origin of SARS-CoV-2 remains unclear 
(i.e zoonosis versus lab accident), all reasonable explanations 
agree that at a deeper level the SARS-CoV-2 genome is de- 
rived from bat coronaviruses (Lytras et al. 2021). One would 
therefore expect the first reported SARS-CoV-2 sequences to 
be the most similar to these bat coronavirus relatives—but 
this is not the case. 

This conundrum is illustrated in fig. 2, which plots the 
collection date of SARS-CoV-2 sequences in GISAID versus 
the relative number of mutational differences from RaTG13 
(Zhou, Yang, et al. 2020), which is the bat coronavirus with 
the highest full-genome sequence identity to SARS-CoV-2. 
The earliest SARS-CoV-2 sequences were collected in 
Wuhan in December, but these sequences are more distant 
from RaTG13 than sequences collected in January from other 
locations in China or even other countries (fig. 2). The dis- 
crepancy is especially pronounced for sequences from 
patients who had visited the Huanan Seafood Market 
(WHO 2021). All sequences associated with this market differ 
from RaTG13 by at least three more mutations than sequen- 
ces subsequently collected at various other locations (fig. 2). 
Importantly, all these observations also hold true if SARS- 
CoV-2 is compared to other related bat coronaviruses 
(Lytras et al. 2021) such as RpYNO6 (Zhou et al. 2021) or 
RmYNO2 (Zhou, Chen, et al. 2020) rather than RaTG13 (sup- 
plementary fig. S3, Supplementary Material online). 

This conundrum can be visualized in a phylogenetic con- 
text by rooting a tree of early SARS-CoV-2 sequences so that 
the progenitor sequence is closest to the bat coronavirus out- 
group. If we limit the analysis to sequences collected no later 
than January 2020 and remove singleton mutations observed 
only once in the sequence set, then there are three ways to 
root the tree in this fashion since there are three different 
sequences equally close to the outgroup (fig. 3 and 
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Table 1. Samples for which the SARS-CoV-2 sequence could be called at >90% of sites between 21,570 and 29,550, and the substitutions in this 
region relative to the putative SARS-CoV-2 progenitor proCoV2 inferred by Kumar et al. (2021). 


Sample Fraction sites called (21,570-29,550) Patient group Substitutions relative to proCoV2 

A4 0.9266 Early outpatient None 

C1 0.9396 Early outpatient G22081A (A = 924, C = 4, G = 9), 

C28144T (C = 6, T = 1185), 
T29483G (C = 1,G = 45, T=1) 

C2 0.9397 Early outpatient C29095T (C = 1, G = 1, T = 751) 

C9 0.9005 Early outpatient C28144T (C = 3, T = 823), 
G28514T (G = 1, T = 36) 

D9 0.9051 Early outpatient C28144T (C = 4, T = 1653) 

D12 0.9400 Early outpatient C28144T (C = 8, T = 2400) 

E1 0.9223 Early outpatient C28144T (T = 125) 

E5 0.9227 Early outpatient C24034T (A =5, C=3, T =74), 

T26729C (C = 12), G28077C 
(C=142,G=4) 

E11 0.9321 Early outpatient C25460T (C = 2, T = 246), 
C28144T (C = 1, T= 412) 

F11 0.9054 Early outpatient T25304A (A = 9, T = 1), C28144T 
(C= 6, G= 1, T= 1328) 

G1 0.9396 Early outpatient None 

G11 0.9112 Early outpatient None 

H9 0.9381 Early outpatient C28144T (C =2, T = 1254) 

R11 0.9422 Hospital patient (Feb) C21707T (T = 401), C28144T 


(A= 1, C= 18, T = 4265) 


Numbers in parentheses after each substitution give the deep sequencing reads with each nucleotide identity. 


from Huanan Seafood Market 
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Fic. 2. The reported collection dates of SARS-CoV-2 sequences in GISAID versus their relative mutational distances from the RaTG13 bat 
coronavirus outgroup. Mutational distances are relative to the putative progenitor proCoV2 inferred by Kumar et al. (2021), which itself differs 
from RaTG13 by 1,132 mutations—so a sequence with a relative mutational distance of 2 actually has 1,134 differences from RaTG13. Note that the 
lower-right point in the middle (green) panel corresponds to a sequence (Guangdong/FS-30-P00502/2020) reportedly collected in late February 
that is actually two mutations more similar to RaTG13 than proCoV2. The plot shows only sequences in GISAID collected no later than February 
28, 2020. Sequences that the joint WHO-China report (WHO 2021) describes as being associated with the Wuhan Seafood Market are plotted with 
squares. Points are slightly jittered on the y-axis. Go to https://jbloom.github.io/SARS-CoV-2_PRJNA612766/deltadist.html for an interactive 
version of this plot that enables toggling of the outgroup to RDYNO6 and RmYNO2, mouseovers to see details for each point including strain name 
and mutations relative to proCoV2, and adjustment of the y-axis jittering. Static versions of the plot with RBYNO6 and RmYNO2 outgroups are in 
supplementary fig. $3, Supplementary Material online. 


supplementary figs. S4 and S5, Supplementary Material on- 
line). Importantly, none of these rootings place any Huanan 
Seafood Market viruses (or other Wuhan viruses from 
December 2019) in the progenitor node—and only one of 
the rootings has any virus from Wuhan in the progenitor 
node (in the leftmost tree in fig. 3, the progenitor node con- 
tains Wuhan/0126-C13/2020, which was reportedly collected 
on January 26, 2020). Therefore, inferences about the progen- 
itor of SARS-CoV-2 based on comparison to related bat viruses 


are inconsistent with other evidence suggesting that the pro- 
genitor is an early virus from Wuhan (Pipes et al. 2021). 
Several plausible explanations have been proposed for the 
discordance of phylogenetic rooting with evidence that 
Wuhan was the origin of the pandemic. Rambaut et al. 
(2020) suggest that viruses from the clade labeled “B” in 
fig. 3 may just “happen” to have been sequenced first, but 
that other SARS-CoV-2 sequences are really more ancestral as 
implied by phylogenetic rooting. Pipes et al. (2021) discuss the 
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progenitor as USA/WA1/2020 (2020-01-19) 
mutations from proCoV2 (Kumar et al): none 
mutations from Wuhan-Hu-1: C8782T, C18060T, T28144C 


progenitor as Guangdong/HKU-SZ-002/2020 (2020-01-10) 
mutations from proCoV2 (Kumar et al): T18060C, C29095T 
mutations from Wuhan-Hu-1: C8782T, T28144C, C29095T 


MBE 


progenitor as Shandong/LY005-2/2020 (2020-01-24) 
mutations from proCoV2 (Kumar et al): T3171C, T18060C 
mutations from Wuhan-Hu-1: T3171C, C8782T, T28144C 
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Fic. 3. Phylogenetic trees of SARS-CoV-2 sequences in GISAID collected before February 2020. The trees are identical except they are rooted to 
make the progenitor each of the three sequences with highest identity to the RaTG13 bat coronavirus outgroup. Nodes are shown as pie charts 
with areas proportional to the number of observations of that sequence and colored by where the viruses were collected. The mutations on each 
branch are labeled, with mutations toward the nucleotide identity in the outgroup in purple. The labels at the top of each tree give the first known 
virus identical to each putative progenitor, as well as mutations in that progenitor relative to proCoV2 (Kumar et al. 2021) and Wuhan-Hu-1. The 
monophyletic group containing C28144T is collapsed into a node labeled “clade B” in concordance with the naming scheme of Rambaut et al. 
(2020); this clade contains Wuhan-Hu-1. Singleton mutations (mutations observed only once in the sequence set) are removed as described in 
more detail in the Methods. Supplementary figs. S4 and S5, Supplementary Material online, show identical results are obtained if the outgroup is 


RpYNO6 or RMYNO2. 


conundrum in detail and suggest that phylogenetic rooting 
could be incorrect due to technical reasons such as high di- 
vergence of the outgroup or unusual mutational processes 
not captured in substitution models. Kumar et al. (2021) 
agree that phylogenetic rooting is problematic and circum- 
vent this problem by using an alternative algorithm to infer a 
progenitor for SARS-CoV-2 that they name proCoV2. 
Notably, proCoV2 turns out to be identical to one of the 
putative progenitors yielded by my approach in fig. 3 of sim- 
ply placing the root at the nodes closest to the outgroup—a 
perhaps unsurprising result, since both approaches use out- 
groups but ignore sampling dates. It is also possible that there 
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was an early ascertainment bias toward market-associated 
viruses (clade B), since some early case definitions included 
exposure to the Huanan Seafood Market. 

Another explanation that | consider less plausible is offered 
by Garry (2021): that there were multiple zoonoses from dis- 
tinct markets, with the Huanan Seafood Market being the 
source of viruses in clade B, and some other market being 
the source of viruses that lack T8782C and C28144T (fig. 3). 
However, this explanation requires positing zoonoses in two 
markets by two progenitors differing by just two mutations, 
which seems nonparsimonious in the absence of direct evi- 
dence for zoonosis in any market. 
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Sequences Recovered from the Deleted Project and 
Better Annotation of Wuhan-Derived Viruses Help 
Reconcile Inferences about SARS-CoV-2’s Progenitor 
To examine if the sequences recovered from the deleted data 
set help resolve the conundrum described in the previous 
section, | repeated the analyses including those sequences. 
In the process, | noted another salient fact: four GISAID 
sequences collected in Guangdong that fall in a putative pro- 
genitor node are from two different clusters of patients who 
traveled to Wuhan in late December of 2019 and developed 
symptoms before or on the day that they returned to 
Guangdong, where their viruses were ultimately sequenced 
(Chan et al. 2020; Kang, Wu, et al. 2020). Since these patients 
were clearly infected in Wuhan even though they were se- 
quenced in Guangdong, | annotated them separately from 
both the other Wuhan and other China sequences. 

Repeating the analysis of the previous section with these 
changes shows that several sequences from the deleted proj- 
ect and all sequences from patients infected in Wuhan but 
sequenced in Guangdong are more similar to the bat coro- 
navirus outgroup than sequences from the Huanan Seafood 
Market (fig. 4). 

Furthermore, it is immediately apparent that the discrep- 
ancy between outgroup rooting and the evidence that 
Wuhan was the origin of SARS-CoV-2 is alleviated by adding 
the deleted sequences and annotating Wuhan infections se- 
quenced in Guangdong, The rooting of the middle tree in fig. 5 
is now more consistent with the evidence that the pandemic 
originated in Wuhan, as half its progenitor node is derived 
from early Wuhan infections, which is more than any other 
equivalently large node. The first known sequence identical to 
this putative progenitor (Guangdong/HKU-SZ-002/2020) is 
from a patient who developed symptoms on January 4 while 
visiting Wuhan (Chan et al. 2020). This putative progenitor 
has three mutations toward the bat coronavirus outgroup 
relative to Wuhan-Hu-1 (C8782T, T28144C, and C29095T) 
and two mutations relative to proCoV2 (T18060C away 
from the outgroup and C29095T toward the outgroup). 
The leftmost tree in fig. 5, which has a progenitor identical 
to proCoV2 (Kumar et al. 2021), also looks plausible, with 
some weight from Wuhan sequences. However, analysis of 
this rooting is limited by the fact that the defining C18060T 
mutation is in a region not covered in the deleted sequences. 
The rightmost tree in fig. 5 looks less plausible, as it has almost 
no weight from Wuhan and the first sequence identical to its 
progenitor was not collected until January 24. 

We can also qualitatively examine the three progenitor 
placements in fig. 5 using the principle employed by 
Worobey et al. (2020) to help evaluate scenarios for the emer- 
gence of SARS-CoV-2 in Europe and North America: namely 
that during a growing outbreak, a progenitor is likely to give 
rise to multiple branching lineages. This principle is especially 
likely to hold for the scenarios in fig. 5, since there are multiple 
individuals infected with each putative progenitor sequence, 
implying multiple opportunities to transmit descendants 
with new mutations. Using this qualitative principle, the mid- 
dle and leftmost scenarios in fig. 5 seem more plausible than 
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Fic. 4. Relative mutational distance from RaTG13 bat coronavirus 
outgroup calculated only over the region of the SARS-CoV-2 genome 
covered by sequences from the deleted data set (21,570-29,550). 
Because the calculated distances here are only over a portion of the 
genome, there are more negative points than in fig. 2. The plot shows 
sequences in GISAID collected before February 2020, as well as the 13 
early Wuhan epidemic sequences in table 1. Mutational distance is 
calculated relative to proCoV?, and points are jittered on the y-axis. 
Go to https://jbloom.github.io/SARS-CoV-2_PRJNA612766/deltad- 
ist_jitter.html for an interactive version of this plot that enables tog- 
gling the outgroup to RpYNO6 or RMYNO2, mouseovers to see details 
for each point, and adjustment of jittering. 


the rightmost scenario, where the progenitor lacks branching 
descendants. | acknowledge that these arguments are purely 
qualitative and lack the formal statistical analysis of Worobey 
et al. (2020)—but as discussed below, there may be wisdom in 
qualitative reasoning when there are valid concerns about the 
sampling distribution of the underlying data. 


Discussion 


| have identified and recovered a deleted set of partial SARS- 
CoV-2 sequences from the early Wuhan epidemic. Analysis of 
these sequences leads to several conclusions. First, they pro- 
vide further evidence that the Huanan Seafood Market 
sequences that were the focus of the joint WHO-China report 
(WHO 2021) are not representative of all SARS-CoV-2 in 
Wuhan early in the epidemic. The deleted data and existing 
sequences from Wuhan-infected patients hospitalized in 
Guangdong show early Wuhan sequences often carried the 
T29095C mutation and were less likely to carry T8782C/ 
C28144T than sequences in the joint WHO-China report 
(WHO 2021). Second, given current data, there are two 
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progenitor as USA/WA1/2020 (2020-01-19) 
mutations from proCoV2 (Kumar et al): none 
mutations from Wuhan-Hu-1: C8782T, C18060T, T28144C 


progenitor as Guangdong/HKU-SZ-002/2020 (2020-01-10) 
mutations from proCoV2 (Kumar et al): T18060C, C29095T 
mutations from Wuhan-Hu-1: C8782T, T28144C, C29095T 
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progenitor as Shandong/LY005-2/2020 (2020-01-24) 
mutations from proCoV2 (Kumar et al): T3171C, T18060C 
mutations from Wuhan-Hu-1: T3171C, C8782T, T28144C 
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Fic. 5. Phylogenetic trees like those in fig. 3 with the addition of the early Wuhan epidemic sequences from the deleted data set, and Guangdong 
patients infected in Wuhan prior to January 5 annotated separately. Because the deleted sequences are partial, they cannot all be placed 
unambiguously on the tree. Therefore, they are added to each compatible node proportional to the number of sequences already in that 
node. The deleted sequences with C28144T (clade B) or C29095T (putative progenitor in middle tree) can be placed relatively unambiguously 
as defining mutations occur in the sequenced region, but those that lack either of these mutations are compatible with a large number of nodes 


including the proCoV2 putative progenitor. Supplementary figs. $4 and 
identical if RpYN06 or RMYNO2 is instead used as the outgroup. 


plausible identities for the progenitor of all known SARS-CoV- 
2. One is proCoV2 described by Kumar et al. (2021), and the 
other is a sequence that carries three mutations (C8782T, 
T28144C, and C29095T) relative to Wuhan-Hu-1. Crucially, 
both putative progenitors are three mutations closer to 
SARS-CoV-2’s bat coronavirus relatives than sequences 
from the Huanan Seafood Market. Note also that the pro- 
genitor of all known SARS-CoV-2 sequences could still be 
downstream of the sequence that infected patient zero— 
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S5, Supplementary Material online, demonstrate that the results are 


and it is possible that the future discovery of additional early 
SARS-CoV-2 sequences could lead to further revisions of 
inferences about the earliest viruses in the outbreak. 

The fact that this informative data set was deleted suggests 
implications beyond those gleaned directly from the recov- 
ered sequences. Samples from early outpatients in Wuhan are 
a gold mine for anyone seeking to understand spread of the 
virus. Even my analysis of 13 partial sequences is revealing, and 
it clearly would have been more scientifically informative to 
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fully sequence all 34 samples rather than delete the partial 
sequence data. There is no obvious scientific reason for the 
deletion: the sequences are concordant with the samples 
described in Wang et al. (2020), there are no corrections to 
the paper, the paper states human subjects approval was 
obtained, and the sequencing shows no evidence of plasmid 
or sample-to-sample contamination. After | e-mailed the NIH 
the original version of this manuscript, they sent me the e- 
mail requesting deletion of the data, which is in fig. 6. Despite 
the statement in the deletion-request e-mail that the sequen- 
ces were being uploaded to “another website” (fig. 6), | could 
find no evidence that they were actually uploaded to any 
other public website—and certainly they were not posted 
to GISAID, NCBI, or any database used by the joint WHO- 
China report. Therefore, even though the sequencing data 
were listed in a table in the Small paper by Wang et al. 
(2020), the practical consequence of removing the data 
from the SRA was that nobody was aware these sequences 
existed. Notably, it is not possible to delete preprints from 
bioRxiv and medRxiv, so once Wang et al. (2020) had posted 
their preprint, it was permanently committed to the public 


Dear [SRA], 


Thanks for your replay. Yes, | want to withdraw both 2 submissions XXXXX and YYYYY. 
The Bioprojects, Biosamples and all SRA objects should be withdrawn as well 


Best regards, 


[Submitter] 
Wuhan University 


From: NLM Support 

Date: 2020-06-16 21:00 

To: [submitter] 

Subject: Re: SUBXXXXX/subs/sra/SUBXXXXX/overview 

Dear [Submitter], 

Do you want to withdraw all SRA objects created in your account? 
here are 2 submissions XXXX and YYYYY. 

Also, bioprojects and biosamples whould be withdrawn as well, right? 


Best regards, 


[SRA curator] 


If you have any questions or concerns regarding your SRA submission 
please don’t hesitate to contact sra(@ncbi.nlm.nih.gov (applies to new 
questions). We normally respond within 2 business days. 


[SRA curator] 
The NCBI SRA database submission staff 


Summary: Re: SUBXXXXX/subs/sra/SUBXXXXX/overview 


Details: 
Dear Mr/Ms, 


Best regard, 


[Submitter] 
Wuhan University 


Fic. 6. A redacted version of the e-mails from Wuhan University to 
the SRA staff requesting deletion of the sequencing data. This e-mail 
was provided to me by the NIH’s NCBI Director Stephen Sherry on 
June, 19 2021, the day after | e-mailed the NIH an advance copy of this 
manuscript. The redactions and highlighting were done by the NIH, 
and | am showing the e-mail exactly as it was provided to me. 
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record (withdrawn preprints are still accessible, for instance 
see Yang et al. 2020). Particularly in light of the directive that 
labs destroy early samples (Pingui 2020) and multiple orders 
requiring approval of publications on COVID-19 (China CDC 
2020; Kang, Cheng, et al. 2020), the deletion of the data from 
the SRA suggests a less than wholehearted effort to maximize 
the availability of information about viral sequences from 
early in the Wuhan epidemic. 

However, about a month after | publicly posted this man- 
uscript to bioRxiv as a preprint, the deleted sequence data 
were discussed in a July 21, 2021, press conference by the vice 
minister of China’s National Health Commission and subse- 
quently in two blog posts by a China state-affiliated reporter 
(Wang 2021a, 2021b). The press conference and blog posts 
stated that the sequence data were removed from the SRA 
because the journal Small had mistakenly deleted the data 
availability statement in the paper of Wang et al. (2020), 
leading the authors to think that they should also delete 
the actual data. Confusingly, this explanation is inconsistent 
with the reason provided by the authors to the SRA (fig. 6), 
and | am unable to reconcile the discrepancy. The press con- 
ference and blog posts also stated that the sequences were all 
collected on or after January 30, 2020, rather than “early in the 
epidemic’ as originally described in Wang et al. (2020). Finally, 
the press conference and blog noted that on July 8, 2021, the 
deleted data were made available on the Chinese NGDC 
database under accession PRJCA005725. 

Regardless of exactly why the data were deleted, another 
important implication of this study is that genomic epidemi- 
ology studies of early SARS-CoV-2 need to pay as much at- 
tention to the provenance and annotation of the underlying 
sequences as technical considerations. There has been sub- 
stantial scientific effort expended on topics such as phyloge- 
netic rooting (Morel et al. 2021; Pipes et al. 2021), novel 
algorithms (Kumar et al. 2021), and correction of sequencing 
errors (Turakhia et al. 2020). Future studies should devote 
equal effort to going beyond the annotations in GISAID to 
carefully trace the location of patient infection and sample 
sequencing. The potential importance of such work is 
revealed by the observation that many of the sequences clos- 
est to SARS-CoV-2’s bat coronavirus relatives are from early 
patients who were infected in Wuhan but then sequenced in 
and attributed to Guangdong. 

There are several caveats to this study. Most obviously, the 
sequences recovered from the deleted data set are partial and 
lack full metadata. Therefore, it is impossible to unambigu- 
ously place them phylogenetically, or determine exactly when 
they were collected. It is also important to note that my 
phylogenetic analyses use relatively simple methods to 
draw qualitative conclusions without formal statistical test- 
ing. Further application of more advanced methods would be 
a welcome advance. However, qualitative and visual analyses 
do have advantages when the key questions relate more to 
the underlying data than the sophistication of the inferences. 
Finally, both plausible putative progenitors require that an 
early mutation to SARS-CoV-2 was a reversion toward the bat 
coronavirus outgroups (either C18060T or C29095T) on a 
branch that subsequently gave rise to multiple distinct 
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descendants. Such a scenario can only be avoided by invoking 
recombination very early in the pandemic, which is not en- 
tirely implausible for a coronavirus (Boni et al. 2020). 
However, because the outgroups have ~4% nucleotide diver- 
gence from SARS-CoV-2, a mutation toward the outgroup is 
also entirely possible. Of course, future identification of addi- 
tional early sequences could fully resolve these questions. 

More broadly, the approach taken here suggests that it 
may be possible to learn more about the origin or early 
spread of SARS-CoV-2 even without an international inves- 
tigation. | suggest that it could be worthwhile for the NIH to 
review e-mail records to identify other SRA deletions. 
Importantly, SRA deletions do not imply any malfeasance: 
there are legitimate reasons for removing sequencing runs 
and the SRA houses >13-million runs making it infeasible for 
its staff to validate the rationale for all requests. However, the 
current study suggests that at least in one case, the trusting 
structures of science permitted a data deletion that ob- 
scured sequences relevant to the early spread of SARS- 
CoV-2 in Wuhan. A careful re-evaluation of other archived 
forms of scientific communication, reporting, and data could 
shed light on additional overlooked information relevant to 
the early emergence of the virus. 


Materials and Methods 


Code and Data Availability 
The computer code and input data necessary to reproduce all 
analyses described in this paper are available on GitHub at 
https://github.com/jbloom/SARS-CoV-2_PRJNA612766. This 
GitHub repository includes a Snakemake (Mölder et al. 
2021) pipeline that fully automates all steps in the analysis 
except for downloading the sequences from GISAID, which 
must be done manually as described in the GitHub reposi- 
tory’s README to comply with GISAID data sharing terms. 
The deleted SRA files recovered from the Google Cloud are 
all available at https://github.com/jbloom/SARS-CoV-2_ 
PRJNA612766/tree/main/results/sra_downloads. | have suf- 
fixed the file extension.sra to all these files. The consensus 
sequences recovered from these deleted SRA files are linked 
to in the relevant subsection. 


Archiving of Key Weblinks 

| have digitally archived key weblinks in the Wayback 
Machine, including a subset of the SRA files from 
PRJNA612766 on the Google Cloud: 


e The first supplementary table of Farkas et al. (2020) is 
archived at https://web.archive.org/web/ 
20210502130356/ —https://dfzljidn9uc3pi.cloudfront.net/ 
2020/9255/1/Supplementary_Table_1.xlsx. 

SRR11313485: https://storage.googleapis.com/nih-se- 
quence-read-archive/run/SRR11313485/SRR11313485 
SRR11313486: https://storage.googleapis.com/nih-se- 
quence-read-archive/run/SRR11313486/SRR11313486 
SRR11313274: https://storage.googleapis.com/nih-se- 
quence-read-archive/run/SRR11313274/SRR11313274 
SRR11313275: https://storage.googleapis.com/nih-se- 
quence-read-archive/run/SRR11313275/SRR11313275 
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© SRR11313285: https://storage.googleapis.com/nih-se- 
quence-read-archive/run/SRR11313285/SRR11313285 

© SRR11313286: https://storage.googleapis.com/nih-se- 
quence-read-archive/run/SRR11313286/SRR11313286 

© SRR11313448: https://storage.googleapis.com/nih-se- 
quence-read-archive/run/SRR11313448/SRR11313448 

© SRR11313449: https://storage.googleapis.com/nih-se- 
quence-read-archive/run/SRR11313449/SRR11313449 

© SRR11313427: https://storage.googleapis.com/nih-se- 
quence-read-archive/run/SRR11313427/SRR11313427 

© SRR11313429: https://storage.googleapis.com/nih-se- 
quence-read-archive/run/SRR11313429/SRR11313429 


Recovery of SRA Files from Deleted Project 
PRJNA612766 
| parsed the first supplementary table of Farkas et al. (2020) to 
extract the accessions for sequencing runs for deleted SRA 
BioProject PRINA612766. By cross-referencing the samples 
described in this table to Wang et al. (2020), | identified the 
accessions corresponding to the 34 early outpatient samples 
who were positive, as well as the accessions corresponding to 
the 16 hospitalized patient samples from February. Samples 
had both 10min and 4h sequencing runtime accessions, 
which were combined in the subsequent analysis. | also iden- 
tified the samples corresponding to the high-copy plasmid 
controls to enable the analysis of the plasmid sequence to 
rule out contamination. The code used to parse the Excel 
table is available as a Jupyter notebook at https://github.com/ 
jbloom/SARS-CoV-2_PRJNA612766/tree/main/manual_anal- 
yses/PRJNA612766. 

| recovered the SRA files from the Google Cloud by using 
wget to download files with from paths like https://storage. 
googleapis.com/nih-sequence-read-archive/run/ 
SRR11313485/SRR11313485. Note that | cannot guarantee 
that these Google Cloud links will remain active, as my anal- 
yses of other deleted SRA runs (beyond the scope of this 
study) indicate that only sometimes are deleted SRA files still 
available via the Google Cloud. For this reason, key runs have 
been archived on the Wayback Machine as described above, 
and all downloaded SRA files relevant to this study are in- 
cluded in the GitHub repository. Note also that as described 
in this paper’s main text, two SRA files could not be down- 
loaded from the Google Cloud using the aforementioned 
method, and so are not part of this study. 


FASTQ Files for SRR11313490 and SRR11313499 

| was not able to download SRA files for two runs, 
SRR11313490 and SRR11313499, from the Google Cloud. 
After | posted the initial version of this preprint, | was con- 
tacted by three different individuals who had uncovered 
FASTQ or FASTA files for these two missing runs that 
were downloaded from the SRA prior to the deletion of 
PRJNA612766. For revised versions of the manuscript, | in- 
cluded analysis of the data for these two runs that | obtained 
from the links provided at https://github.com/lifebit-ai/ 
SARS-CoV-2/blob/master/assets/ucsc/aws_https_links.txt. 
Inclusion of data for these two runs did not appreciably 
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change the results of the analysis relative to that in the 
original version of the preprint, since both runs corre- 
sponded to low coverage samples for which meaningful viral 
genetic information could not be obtained. 


Alignment of Recovered Reads and Calling of 
Consensus Sequences 

The downloaded SRA files were converted to FASTQ files 
using fasterq-dump from the SRA Toolkit. The FASTQ files 
were preprocessed with fastp (Chen et al. 2018) to trim reads 
and remove low-quality ones (the exact settings using in this 
preprocessing are specified in the Snakemake file in the 
GitHub repository). 

The reads in these FASTQ files were then aligned to a 
SARS-CoV-2 reference genome using minimap2 (Li 2018) 
with default settings. The reference genome used for the en- 
tirety of this study is proCoV2 (Kumar et al. 2021), which was 
generated by making the following three single-nucleotide 
changes to the Wuhan-Hu-1 reference (ASM985889v2) avail- 
able on NCBI: C8782T, C18060T, and T28144C. 

| processed the resulting alignments with samtools and 
pysam to determine the coverage at each site by aligned 
nucleotides with a quality score of at least 20. | also masked 
(set to zero coverage) all sites overlapped by the primers used 
by Wang et al. (2020) to PCR the amplicons used for sequenc- 
ing. These coverage plots are in supplementary figs. S1 and S2, 
Supplementary Material online; the legends of these figures 
also link to interactive versions of the plots that enable zoom- 
ing and mouseovers to get statistics for specific sites. | called 
the consensus sequence at a site if this coverage was > 3% and 
>80% of the reads agreed on the identity. These consensus 
sequences over the entire SARS-CoV-2 genome are available 
at —https://github.com/jbloom/SARS-CoV-2_PRJNAG612766/ 
raw/main/results/consensus/consensus_seqs.csv; note that 
they are mostly N nucleotides since the sequencing approach 
of Wang et al. (2020) only covers part of the genome. 

| only used the recovered consensus sequences in the down- 
stream analyses if it was possible to call the consensus identity 
at >90% of the sites in the region of interest from sites 21,570- 
29,550. These are the sequences listed in table 1, and as de- 
scribed in that table, all mutation calls were at sites with cov- 
erage >10. These sequences in the region of interest (21,570- 
29,550) are available at https://github.com/jbloom/SARS-CoV- 
2_PRJNAG612766/blob/main/results/recovered_seqsa. 


Bat Coronavirus Outgroup Sequences 

For analyses that involved comparisons to SARS-CoV-2’s bat 
coronavirus relatives (Lytras et al. 2021), the bat coronavirus 
sequences were manually downloaded from GISAID (Shu 
and McCauley 2017). The sequences used were RaTG13 
(Zhou, Yang, et al. 2020), RmMYNO2 (Zhou, Chen, et al. 
2020), and RpYNOG (Zhou et al. 2021)—although the mul- 
tiple sequence alignment of these viruses to SARS-CoV-2 also 
contains PrC31 (Li et al. 2021), which was not used in the 
final analyses as it more diverged from SARS-CoV-2 than the 
other three bat coronaviruses at a whole-genome level. The 
GISAID accessions for these sequences are listed at https:// 
github.com/jbloom/SARS-CoV-2_PRJNA612766/blob/main/ 
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data/comparator_genomes_gisaid/accessions.txt, and a ta- 
ble acknowledging the labs and authors is at https:// 
github.com/jbloom/SARS-CoV-2_PRJNA612766/blob/main/ 
data/comparator_genomes_gisaid/acknowledgments.csv. 

Sites in SARS-CoV-2 were mapped to their corresponding 
nucleotide identities in the bat coronavirus outgroups via a 
multiple sequence alignment of proCoV2 to the bat corona- 
viruses generated using mafft (Katoh and Standley 2013). 


Curation and Analysis of Early SARS-CoV-2 Sequences 
from GISAID 

For the broader analyses of existing SARS-CoV-2 sequences, 
| downloaded all sequences collected prior to March of 
2020 from GISAID. The accessions of these sequences are 
at —https://github.com/jbloom/SARS-CoV-2_PRJNA612766/ 
blob/main/data/gisaid_sequences_through_Feb2020/acces- 
sions.txt, and a table acknowledging the labs and authors 
is at https://github.com/jbloom/SARS-CoV-2_PRJNA612766/ 
blob/main/data/gisaid_sequences_through_Feb2020/ 
acknowledgments.csv. 

I then used mafft (Katoh and Standley 2013) to align these 
sequences to the proCoV2 reference described above, 
stripped any sites that were gapped relative to the reference, 
and filtered the sequences using the following criteria: 


© | removed any sequences collected after February 28, 
2020. 

© | removed any sequences that had >4 mutations within 
any 10-nucleotide stretch, as such runs of mutations of- 
ten indicate sequencing errors. 

© | removed any sequence for which the alignment covered 
<90% of the proCoV2 sequence. 

© | removed any sequence with >15 mutations relative to 
the reference. 

èe | removed any sequence with >5,000 ambiguous 
nucleotides. 


| then annotated the sequences using some additional 
information. First, | annotated sequences based on the 
joint WHO-China report (WHO 2021) and also Zhu et al. 
(2020) to keep only one representative from multiply se- 
quenced patients and to indicate which sequences were 
from patients associated with the Huanan Seafood Market. 
My version of these annotations is at https://github.com/ 
jbloom/SARS-CoV-2_PRJNA612766/blob/main/data/WHO_ 
China_Report_Dec2019_cases.yaml. Next, | identified some 
sequences in the set that were clearly duplicates from 
the same patient, and removed these. The annotations 
used to remove these duplicates are at https://github.com/ 
jbloom/SARS-CoV-2_PRJNA612766/blob/main/data/seqs_ 
to_excludeyaml. Finally, | used information from Chan et al. 
(2020) and Kang, Wu, et al. (2020) to identify patients 
who were infected in Wuhan before January 5, 2020, but 
ultimately sequenced in Guangdong: these annotations are 
at https://github.com/jbloom/SARS-CoV-2_PRJNA612766/ 
blob/main/data/Wuhan_exports.yaml. 

| next removed any of the handful of mutations noted by 
Turakhia et al. (2020) to be lab artifacts that commonly afflict 
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SARS-CoV-2 sequences. | also limited the analyses to the re- 
gion of the genome that spans from the start of the first 
coding region (ORF1ab) to the end of the last (ORF10) be- 
cause | noticed that some sequences had suspicious patterns 
(such as many mutations or runs of mutations) near the 
termini of the genome. 

The plot in fig. 2 contains all of the GISAID sequences after 
this filtering. The plot in fig. 4 shows the filtered GISAID 
sequences collected before February of 2020 plus the 13 
good coverage recovered partial early outpatient sequences 
(table 1), considering only the region covered by the partial 
sequences (21,570—29,550). 


Phylogenetic Analyses 
The phylogenetic trees were inferred using the GISAD 
sequences after the filtering and annotations described above, 
only considering sequences with >95% coverage over the 
region of interest that were collected before February of 
2020. In addition, after generating this sequence set | removed 
any mutations from sequences that were singletons in the 
sense that they were only observed once in the sequence 
set—a similar approach was used by Kumar et al. (2021), 
and this approach should eliminate rare mutations that could 
be sequencing errors. A file indicating the unique sequences 
used for the phylogenetic analysis, their mutations relative to 
proCoV2, and other sequences in that cluster is at https:// 
github.com/jbloom/SARS-CoV-2_PRJNA612766/blob/main/ 
results/phylogenetics/all_alignment.csv. 

| then used IQ-Tree (Minh et al. 2020) to infer a 
maximum-likelihood phylogenetic tree using a GTR nucleo- 
tide substitution model with empirical nucleotide frequen- 
cies, and collapsing zero-length branches to potentially allow 
a multifurcating tree. The inference yielded the tree topology 
and branch lengths shown in all figures in this study with 
phylogenetic trees. | then rendered the images of the tree 
using ETE 3 (Huerta-Cepas et al. 2016), manually rerooting 
the tree to place the first (progenitor) node at each of the 
three nodes that have the highest identity to the bat coro- 
navirus outgroup. In these images, node sizes are propor- 
tional to the number of sequences in that node and are 
colored in proportion to the location from which those 
sequences are derived. As indicated in the legend to fig. 3, 
the node containing the monophyletic set of sequences with 
C28144T is collapsed into a single node in the tree images. 

For the trees in which | added the recovered sequences 
from the deleted data set (fig. 5), the actual trees are exactly 
the same as those inferred using the GISAID sequences above. 
The difference is that the sequences from the deleted data set 
are then added to each node with which they are compatible 
given their mutations in an amount proportional to the size 
of the node, the logic being that a sequence is more likely to 
fall into larger clusters. 


Interactive Versions of Some Figures 
Interactive versions of some figures are available at https:// 
jbloom.github.io/SARS-CoV-2_PRJNA612766/ and were cre- 
ated using Altair (VanderPlas et al. 2018). 
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Supplementary Material 


Supplementary data are available at Molecular Biology and 
Evolution online. 
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